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Human genetic variation and its 
contribution to complex traits 

Kelly A. Frazer, Sarah S. Murray, Nicholas J. Schork and Eric J. Topol 

Abstract | The last few years have seen extensive efforts to catalogue human genetic 
variation and correlate it with phenotypic differences. Most common SNPs have now 
been assessed in genome-wide studies for statistical associations with many complex 
traits, including many important common diseases. Although these studies have provided 
new biological insights, only a limited amount of the heritable component of any complex 
trait has been identified and it remains a challenge to elucidate the functional link 
between associated variants and phenotypic traits. Technological advances, such as the 
ability to detect rare and structural variants, and a clear understanding of the challenges 
in linking different types of variation with phenotype, will be essential for future progress. 



Structural variants 
Broadly defined, these are all 
variants that are not single 
nucleotide variants. They 
include insertion-deletions, 
block substitutions, inversions 
of DNA sequences and copy 
number differences. 

Cenome-wide association 
(GWA) study 

An investigation of the 
association between common 
genetic variation and disease. 
This type of analysis requires 
a dense set of markers (for 
example, SNPs) that capture a 
substantial proportion of 
common variation across the 
genome, and large numbers of 
study subjects. 
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Elucidating the inherited basis of genetic variation in 
human health and disease is one of the major scientific 
challenges of the twenty- first century. In 2001 two ref- 
erence versions of the human genome were published. 
One was released by the Human Genome Sequencing 
Consortium and reflected the assembly of sequences 
derived from numerous donors 1 , whereas the other, 
released by Celera Genomics, was a consensus sequence 
derived from five individuals 2 . Importantly, both ver- 
sions represented the human genome as a haploid 
sequence and genetic variation was not annotated. In 
order to study how genetic variants contribute to phe- 
notypic diversity, large-scale studies were initiated to 
identify and catalogue nucleotides that differ among 
individuals. Initial studies focused largely on under- 
standing the range of patterns and frequencies of SNPs 3 " 5 . 
As the prevalence and contribution of structural variants 
to human biology was realized 6,7 , consortia were formed 
and systematic studies were conducted to improve our 
understanding of this class of variants 810 . 

In 2007, the first complete genome sequence of an 
individual, J. Craig Venter 11 , was published, followed 
shortly thereafter by the publication of a second individu- 
als genome, that of James D. Watson 12 . Subsequently, two 
additional genomes from anonymous individuals were 
sequenced: one Han Chinese (Asian) 13 and one Nigerian 
(African) 14 . In aggregate, these studies — published after 
the release of the human genome reference sequence 
— have rapidly increased our knowledge of the various 
forms of human genetic variation, their evolutionary 
histories and the correlations between them. However, 
our understanding of the locations and frequencies of 



structural variants across the genome is still limited, and 
cataloguing these classes of alterations is a high priority. 

Cenome-wide association (GWA) studies are the most 
widely used contemporary approach to relate genetic 
variation to phenotypic diversity 15 . Over the past 
2 years these studies have identified statistical associa- 
tion between hundreds of loci across the genome and 
common complex traits. The results of these studies have 
substantially increased our understanding of the diverse 
molecular pathways underlying specific human diseases. 
However, GWA studies have several limitations. First, 
there is great difficulty moving beyond mere statistical 
associations to identifying the functional basis of the 
link between a genomic interval and a given complex 
trait. Second, SNP associations identified in one popula- 
tion frequently are not transferable to members of other 
populations. Third, the bulk of the heritable fraction of 
complex traits has not been accounted for in recent GWA 
studies. This last point is probably explained by the fact 
that GWA studies do not capture information about rare 
variants and have limited statistical power to detect small 
gene-gene and gene-environment interactions. 

The use of new technologies for assaying DNA 
sequences has provided important insights and raised 
new questions about the roles that different types of 
genetic variants have in human health and disease. Here, 
for each type of genetic variant we discuss their probable 
contribution to overall genetic variation, the approaches 
taken to assess their contribution to phenotypic variation 
and the successes achieved so far. There have been sev- 
eral excellent reviews on structural variation 16,17 as well 
as reviews describing the findings of GWA studies 15,1820 . 



NATURE REVIEWS | GENETICS 



VOLUME 10 | APRIL 2009 | 241 



© 2009 Macmillan Publishers Limited. All rights reserved 



REVIEWS 



Single nucleotide variant 



Insertion-deletion variant 



Block substitution 



Inversion variant 



Copy number variant 



ATTGGCCTTAACC§CCGATTATCAGGAT 
ATTGGCCTTAACCfrjCCGATTATCAGGAT 

ATTGGCCTTAACCC ^AT| CCGATTATCAGGAT 
ATTGGCCTTAACCC l I CCG ATT ATC AGG AT 



ATTGGCCTTAACCCCCGATTATCAGGAT 



attggccttaac ^gtg| gattatcaggat 

attggcctt jaacccccgj attatcaggat 
attggcctt |cgggggtt[ attatcaggat 



att|ggccttaggccttajacccccgattatcaggat 
att |ggcctta- - ^acctccgattatcaggat 



Figure 1 1 Classes of human genetic variants. The nomenclature used to describe the 
various types of structural variants is not yet standard 121 . Here, the terminology used 
aimsto describe the nucleotide composition of the variant and distinguish it from other 
types of variants. Single nucleotide variants are DNA sequence variations in which a 
single nucleotide (A, T, G or C) is altered. Insertion-deletion variants (indels) occur when 
one or more base pairs are present in some genomes but absent in others. They are 
generally composed of only a few bases but can be greater than 80 kb in length 11 . Block 
substitutions describe cases in which a string of adjacent nucleotides varies between 
two genomes. An inversion variant is one in which the order of the base pairs is reversed 
in a defined section of a chromosome. A well-characterized inversion variant that has 
been described in humans involves a section of chromosome 17 in which a -900 kb 
interval is in the reverse order in approximately 20% of individuals with Northern 
European ancestry 122 . Copy number variants occur when identical or nearly identical 
sequences are repeated in some chromosomes but not others. The largest copy 
number variant identified in the Venter genome 11 was almost 2 Mb in length. 



Here we unify the exciting discoveries of these two dis- 
ciplines into a single Review to provide a comprehensive 
overview of our current knowledge of human genetic 
variation and where the key challenges lie for future 
research aimed at understanding the genetic architecture 
of complex traits. 

Classes of human genetic variation 

Human genetic variants are typically referred to as either 
common or rare, to denote the frequency of the minor 
allele in the human population. Common variants are 
synonymous with polymorphisms, defined as genetic 
variants with a minor allele frequency (MAF) of at least 
one percent in the population, whereas rare variants 
have a MAF of less than 1%. Genetic variants are also 
discussed in terms of their nucleotide composition. In 
the broadest sense, variants in the human genome can be 
divided into two different nucleotide composition classes: 
single nucleotide variants and structural variants 10 (FIG. 1 ). 
The vast majority of genetic variants are hypothesized 
to be neutral 21 (that is, they do not contribute to pheno- 
typic variation), achieving significant frequencies in the 
human population simply by chance. However, the rela- 
tive percentage of neutral, near-neutral 22 and non-neutral 
variants remains to be empirically determined. 



Complex traits 

Continuously distributed 
phenotypes that are classically 
believed to result from the 
independent action of many 
genes, environmental factors 
and gene-by-environment 
interactions. 

Minor allele 

The less common allele of a 

polymorphism. 

Linkage disequilibrium 
(LD). In population genetics, 
LD is the nonrandom 
association of alleles. For 
example, alleles of SNPs that 
reside near one another on 
a chromosome often occur 
in nonrandom combinations 
owing to infrequent 
recombination. 



Single nucleotide variants. SNPs are the most prevalent 
class of genetic variation among individuals. On the basis 
of survey sequencing results it has been estimated that the 
human genome contains at least 1 1 million SNPs, with 
~ 7 million of these occurring with a MAF of over 5% 23 and 
the remaining having M AFs between 1 and 5%. Analysis 
of the four fully sequenced individual genomes suggests 



that these original estimates are fairly accurate and that 
most SNPs have been identified and information about 
them deposited in the Single Nucleotide Polymorphism 
database ( dbSNP ) (BOX 1 ). In addition to SNPs there are 
innumerable rare and novel or 'de novo' single nucleotide 
variants, in some cases segregating only in a nuclear 
family or a single individual. For instance, any base pair 
that, when altered, is compatible with life is likely to be 
found in at least one of the -6.7 billion people on Earth. 
However, it is important to note that in any given indi- 
vidual the majority of variants are those that are com- 
mon in the population as a whole (BOX 1 ). Furthermore, 
when the genomes of two individuals are compared, 
the majority of the base pairs that differ are at positions 
with variants that are common in the population. 

The alleles of SNPs located in the same genomic inter- 
val are often correlated with one another. This correla- 
tion structure, or linkage disequilibrium (LD) 24 , varies in a 
complex and unpredictable manner across the genome 
and between different populations. The efforts of Phase I 
of the InternationalHapMap Project 3 , along with those of 
Perlegen Sciences 5 , paved the way for breaking the 
genome down into groups of highly correlated SNPs 
that are generally inherited together (known as LD bins). 
From Phase II of the International HapMap Project 4 it 
was determined that the vast majority of SNPs with a 
MAF of at least 5% could be reduced to -550,000 LD 
bins for individuals of European or Asian ancestry and 
to 1,100,000 LD bins for individuals of African ancestry 
(r 2 > 0.8). By genotyping the DNA sample of an individ- 
ual with a 'tagging' SNP from each LD bin, knowledge 
regarding over 80% of SNPs present at a frequency above 
5% across the genome is gained 25-28 . 

Structural variants. Structural variation, broadly defined, 
refers to all base pairs that differ between individuals and 
that are not single nucleotide variants. Such variation 
includes insertion -deletions (indels), block substitutions, 
inversions of DNA sequences and copy number differences 
(FIG. 1 ). Compared with single nucleotide variants, the 
technological ability to detect structural variants in 
the human genome has only recently emerged 8,10,29-32 . 
Hence our understanding of the locations and frequencies 
of structural variants, and our ability to assay their asso- 
ciation with complex traits, is still maturing 33-38 . Analysis 
of the four fully sequenced human genomes (BOX 1 ) com- 
bined with targeted sequencing of structural variants 
greater than 8 kb in length in eight human genomes 9 has 
provided tremendous insight. These studies suggest that 
structural variation accounts for at least 20% of all genetic 
variants in humans and underlies greater than 70% of 
the variant bases. Altogether, for any given individual, 
structural variants constitute between 9 and 25 Mb of 
the genome (-0.5 to 1%), underscoring the important 
roles of this class of variation in genome evolution and 
in human health and disease. 

LD patterns of common structural variants 

There has been conflicting initial evidence regarding 
whether the alleles of structural polymorphisms are 
in LD with SNPs, and are therefore assayed by proxy 
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Box 1 1 Sequenced genomes provide insights into genetic variation 



Single nucleotide variants in four human genomes 
(n) lndbSNP(%) 

J.Craig Venter's genome 3,213,401 91.0 
James D.Watson's genome 3,322,093 81.7 
Asian genome 3,074,097 86.4 

Yoruban genome 4,139,196 73.6 

Structural variants in the Venter genome 

(n) length (bp) 



Block substitutions 


53,823 


2-206 


1 nd els (heterozyg ous) 


851,575 


1-82,711 


Inversions 


90 


7-670,345 


Copy number variants 


62 


8,855-1,925,949 



So far, four individual genomes have been fully sequenced. 
Two of these have been from Caucasian individuals 11 12 
O.Craig Venter and James D.Watson) — both are 
well-known scientists. The other two have been from 
anonymous individuals, one Han Chinese (Asian) 13 and one 
Nigerian (African) 14 . The genome of J.Craig Venter was 
sequenced using Sanger dideoxy technology, whereas the 
other three genomes were generated using newer DNA 
sequencing technologies that are characterized by shorter 
read lengths. Although single nucleotide variants are 
accurately detected in all four genomes the longer reads 
generated for the Venter genome allowed for better 
assembly and more accurate detection of structural 
variants (FIG. 1). 

The two Caucasian genomes have roughly similar 
numbers of single nucleotide variants (-3.3 million) with 
the vast majority of these sites previously identified as 
variants in the Single Nucleotide Polymorphism database 

(dbSN P; see the table). There are fewer novel single nucleotide variants in J. Craig Venter's genome owing to that fact that 
his genome was partially represented in the Celera human genome assembly 2 and variants in that assembly were 
subsequently mined and deposited into dbSNP 117 . The Asian genome has slightly fewer single nucleotide variants than the 
Caucasian genomes but approximately similar fractions are novel variants. The Yoruban genome has ~1. 25-fold more 
single-base variants than the Caucasian genomes and a greater percentage is novel, which is reflective of the overall 
increased amount of diversity in genomes of individuals with African origins (BOX 2). The fact that in all four genomes the 
majority of single nucleotide variants are present in dbSNP suggests that most human high-frequency SNPs (minor allele 
frequency > 10%) have been discovered. 

Looking at the number of single nucleotide variants that are shared between the three 'out of Africa' genomes, 
-1.2 million (67%) are shared by all three, -1.7 million (52%) are shared between any set of two genomes, and each has -1.0 
million (30%) that are unique to their own genome". Overall, -5.2 million single nucleotide variants were identified in the 
three genomes, the majority being present in dbSNP. As additional genomes are sequenced the number of SNPs present in 
humans will become more apparent, but at this time previous estimates of -11 million are reasonable 23 . Interestingly, these 
data indicate that most single nucleotide variants present in an individual are common rather than rare. The corollary to this 
is that when two human genomes are compared the majority of the bases that differ will be due to common variants. 

On the basis of the Venter genome, Caucasians contain -4.1 million genetic variants, of which -22% are structural 
variants that account for 74% of all variant bases (see the table). This is likely to be an underestimate of the true 
contribution of structural variants to genetic diversity between individuals. In the Venter and Watson genomes, 
10 to 30 Mb of novel sequences that are not present in the reference genome assembly were generated. Furthermore, 
in the Asian and African genomes that were sequenced, more than half of the structural variants identified were not 
present in the reference genome. Interestingly, a study of -1,300 structural variants in the 270 HapMap individuals 
showed that when two genomes are compared 92% of the bases that vary are accounted for by common structural 
variants 41 . In total these data provide two important insights into structural variants: the majority of common 
structural variants are yet to be discovered; and common structural variants constitute the vast majority of base pairs 
that differ between any two individuals. 



using tagging SNPs. Several studies have demonstrated 
that common short indels (1-5 bp) 33,39,40 , as well as 
larger common structural polymorphisms in unique 
regions 8,41 of the genome, are in LD with tagging SNPs. 
Except for a potential skew towards lower M AFs, struc- 
tural variants seem to behave similarly to SNPs in terms 
of both genomic and population distribution, indicat- 
ing a similar evolutionary history: both types of vari- 
ants are 'ancestral', having arisen once in human history 
and shared among individuals by descent rather than 
occurring as the result of recurrent mutations 39 - 41 - 42 . 

The evolutionary history and LD pattern of struc- 
tural polymorphisms in segmental duplications has been 
more difficult to determine. Segmental duplications are 
composed of repeated sequences over 5 kb in length with 
>90% sequence identity 43 . Structural polymorphisms 
are highly enriched in regions of the genome that have 
recently undergone duplication. Indeed, 25 to 50% of all 



nucleotides in large structural variants map in segmen- 
tal duplications, which constitute only 5.3% of genomic 
sequences 29 . This strong relationship between structural 
variants and segmental duplications is reflective of their 
similar natures. Recent data suggest that up to 25% of 
the intervals annotated as segmental duplications in the 
reference human genome sequence actually represent 
copy number variants between individuals rather than 
fixed duplication events 11 . Structural polymorphisms 
in segmental duplications exhibit low LD with tagging 
SNPs 44 . Recent studies indicate that this observed lower 
LD is the result of a paucity of validated SNPs 41 that 
can potentially serve as tags in segmental duplications 
compared with the rest of the genome. These studies 
indicate that common structural variants in segmental 
duplications share a similar evolutionary history with 
those in unique regions of the genome and are in LD 
with neighbouring SNPs. 
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Population stratification 

Subdivision of a population 
into different ethnic groups 
with potentially different 
marker allele frequencies and 
different disease prevalences. 



Contribution of variants to phenotypes 

In humans, hundreds of complex phenotypic traits deter- 
mine how we look and behave, and our propensity to 
develop certain diseases. Each complex phenotype is 
governed by a combination of inherited factors, which 
are largely believed to be genetic variants, and environ- 
mental influences. Full sequencing of human genomes 
has shown that in any given individual there are, on aver- 
age, ~4 million genetic variants encompassing -12 Mb of 
sequence (BOX 1 ). The challenge is to determine which 
of these variants underlies or is responsible for the inher- 
ited components of phenotypes. Over the last decade or 
so the human genetics field has debated 45 - 46 the common 
disease-common variant hypothesis, which posits that 
common complex traits are largely due to common vari- 
ants with small to modest effect sizes 47 *. The opposing 
theory, the rare variant hypothesis, posits that common 
complex traits are the summation of low-frequency, 
high-penetrance variants 50 - 51 . Overall the field is mak- 
ing earnest attempts to determine the relative impor- 
tance of common and rare variants in common complex 
phenotypic traits. 

GWA studies 

Linking common genetic variants to common complex 
traits. Concurrent with the efforts of the scientific com- 
munity to dissect the human genome into LD blocks were 
extraordinary technological advances in assaying SNPs. 
From 1997 to 2007, technological advances moved the 
field from testing one SNP at a time to the assessment of a 
million SNPs per individual. These two fronts of progress 

— one on the empirical determination of the LD structure 
of SNPs across the genome 35 and the other a new-found 
capacity to perform ultra-high-throughput genotyping 

— set the foundation for a veritable avalanche of discov- 
eries of common genetic variants associated with various 
common traits and diseases through GWA studies. 

There are several excellent reviews of GWA study 
designs and analysis that discuss selection of cases and 
controls, and statistical analyses — including dealing 
with population stratification and replication 15,52 - 53 . As 
these topics have been previously covered we do not 
discuss them here. Rather, we focus on the important 
insights about the genetic basis of human complex traits 
gained from GWA studies, as we believe these findings 
have immediate relevance to basic scientists as well as 
the medical research community. We also discuss the 
limitations of the GWA approach in identifying genetic 
variants underlying complex traits, providing insights 
into the key lines of experimentation for the future. 

GWA studies published to date have used various 
commercial genotyping platforms containing approxi- 
mately 300,000 to 500,000 common SNPs to detect 
differences in allele frequencies between cases and con- 
trols 2528 . Such studies are hypothesis-free, as there is no 
bias or presumptive list of candidate genes that are being 
tested 54 . However, the term genome- wide' is a misnomer, 
because approximately 20% of common SNPs are only 
partially tagged or not tagged at all, and rare variants are 
generally not tagged. For over 80 phenotypes — includ- 
ing diseases and biological measurements — GWA 



studies have provided remarkably compelling statisti- 
cal associations for a total of over 300 different loci in 
the human genome 55 . The results have been reported on 
almost a weekly basis from April 2007, with over 220 
studies reported to date. Almost all disease categories 
have been addressed, including cardiovascular, neu- 
rodegenerative, neuropsychiatric, metabolic, autoim- 
mune and musculoskeletal diseases, and several types 
of cancer. 

Enhanced understanding of human diseases. The most 
impressive outcome of this knowledge base, which con- 
nects genomic intervals with complex traits, is a new 
understanding of the molecular underpinnings and 
pathways of many diseases 56 . Notably, most of the genes 
or genomic loci that have been identified through GWA 
studies have not previously been known to be related to 
the complex trait under investigation. For a substantial 
number of common diseases the newly identified path- 
ways suggest that molecular subphenotypes may exist; 
that is, although a number of different pathways might 
potentially be involved in the development of a particular 
disease when all cases are considered, in any individual 
with the disease only one or a subset of these pathways 
might be involved. For example, the genetic propensity 
to develop type 2 diabetes (T2D) seems to involve genes 
in several different pathways that affect pancreatic p-cell 
formation and function, as well as pathways affecting 
fasting glucose levels and obesity 57 ~ 59 (FIG. 2). Likewise, 
many of the loci associated with multiple sclerosis 
involve immune function — including the interleukin 
receptor genes IL2RA and IL7RA, and the HLA-DRA 
locus — but a gene encoding a protein involved in 
axonal function, kinesin family member IB (KIF1B), 
is also associated with this disease 60,61 . Clinicians previ- 
ously considered these conditions as simple phenotypes, 
with all patients with the diagnosis having the same 
underlying biological disorder. 

Surprisingly, there have been several instances in 
which one genomic interval has been associated with 
two or more seemingly distinct diseases. This conver- 
gence of genes associated with multiple diseases has led 
to the concept of the 'diseaseome' 62 , which maps a net- 
work of how different genes and pathways connect to 
various diseases (FIG. 3). Examples include different inter- 
leukin receptor genes that are associated with Crohn's 
disease, multiple sclerosis, systemic lupus erythema- 
tosus and rheumatoid arthritis 56,63 . Such diseases had 
already been thought of as sharing a common immune- 
mediated aetiology, but now there is discrete evidence for 
a common genetic underpinning. Another example is 
the common SNP on chromosome 9p21 that is associ- 
ated with three vascular phenotypes — myocardial inf- 
arction 6466 , abdominal aortic aneurysm and intracranial 
aneurysm 67 . Such conditions would not previously have 
been thought to have a common pathogenic thread. The 
recent exceptional advances in associating genes with 
many diseases have led some to suggest that the text- 
books of medicine need to rewritten to account for our 
enhanced understanding of the interconnectivity of the 
molecular basis underlying distinct diseases. 
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KCNJ11 
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ADAMTS9 



SLC30A8 



TCF7L2 



CAMK1D 



TSPAN8 



3 P-cell formation 

□ p-cell function insulin secretion 

I I Fasting glucose levels 

I I Body mass index 

I I Adipocyte differentiation 

I I Insulin pathway regulation 

I I Unknown role in T2D 

Figure 2 | Insights into the genetic basis of type 2 diabetes (T2D). Genome-wide association (GWA) studies have 
identified 18 genomic intervals that confer increased risktoT2D inCaucasians 58,59 " -75123-127 . Four of these contain 
previously known candidate genes, based on the involvement of rare mutations in monogenic forms of diabetes. 
However, the remaining 14 intervals contain genes that were previously unsuspected in playing a part in the genetic 
basis of T2D. Traditional risk factors for T2D include obesity (defined as increased body mass index), elevated fasting 
glucose levels and impaired (3-cell function, which results in reduced insulin secretion 57 . GWA studies have revealed 
new loci that are associated with these phenotypesinT2D cases. For example, the association of melatonin receptor IB 
(MTNR1B) shows the involvement of the circadian rhythm pathway in fasting glucose levels and T2D 56M,128 .The 
functions of the genes suspected of playing a part in p-cell dysfunction are diverse, with functions that include 
pancreatic islet proliferation, insulin secretion and cell signalling. Six additional genes contain variants that are 
statically associated withT2D, but their role in the disorder has not yet been elucidated. The functional diversity of T2D 
genes and the multitude of pathways in which they are members were not imagined before the results of the GWA 
studies. Note that although insulin-like growth factor 2 mRNA binding protein 2 (IGF2BP2) is known to regulate insulin 
signalling through its binding to insulin-like growth factor 2 (IGF2), there is no data indicating its role in diabetes. 
ADAMTS9, ADAM metallopeptidase with thrombospondin type 1 motif, 9; CAMKlD,calcium/calmodulin-dependent 
protein kinase ID; CDKAL1, CDK5 regulatory subunit associated protein 1-like 1; CDKN, cyclin-dependent kinase 
inhibitor; FTO, fat mass and obesity associated; HHEX, hematopoietically expressed homeobox; HNF1B, HNF1 
homeobox B (also known as TCF2); IDE, insulin-degrading enzyme; JAZF1JAZF zinc finger 1; KCNJ11, potassium 
inwardly rectifying channel, subfamilyj, member 11; NOTCH2, Notch homologue 2; PPARG, peroxisome proliferator- 
activated receptor gamma; SLC30A8, carrier family 30 (zinc transporter), member 8; TCF7L2, transcription factor 7-like 2 
(T-cell specific, HMG-box); THADA, thyroid adenoma associated; TSPAN8, tetraspanin 8; WFS1, Wolfram syndrome 1. 



Limitations of GWA studies in identifying causative vari- 
ants. Despite this exceptional progress, there are substan- 
tial limitations to the GWA study approach. Although 
statistically compelling associations have been identified, 
there is an enormous gap in the ability to provide the 
biological explanation for why a genomic interval tracks 
with a complex trait. For the most part, all we know is 
that a tag SNP for an LD bin is statistically associated with 
a trait, but we have no idea of the precise variants in the 
bin that have a causal role in contributing to variation in 
the trait. It is important to emphasize that tag SNPs are in 
LD not only with other SNPs but also with common struc- 
tural variants, the majority of which have not yet been 
identified. The best way to move from a statistical asso- 
ciation to knowledge of the causative variant is unclear. 
In most cases it will be straightforward to identify causa- 
tive variants that are in LD with a tagging SNP and that 
are located in exons that truncate or otherwise alter the 
gene product. However, the causative variants underlying 
GWA study associations are likely to be regulatory rather 
than coding. For instance, many of the associations so far 
are not even localized to intervals that include a gene. For 
example, the variant at 9p21 that associates with myocar- 
dial infarction is 150 kb from the nearest gene 6 " 6 , and for 
the variants on 8q24 that are associated with susceptibil- 
ity to multiple solid tumours this distance is 300 kb 68 - 69 . 



Experiments are being conducted that simultaneously 
assay global gene expression and genome-wide variation 
in a large number of individuals to map genetic factors 
underlying differences in expression levels 70 . These data 
sets may be valuable tools for identifying the causative 
variants and biological bases for many loci associated 
with a complex trait through GWA studies. 

Transferring GWA study results to other populations. 

With rare exceptions, the GWA studies carried out so 
far have focused on populations of European ancestry 
for the primary, high-throughput genotyping and have 
only interrogated other ancestries using limited replica- 
tion genotyping. Unless a particular functional variant 
has been unambiguously identified, testing a tag SNP 
that is associated with a disease or trait in one popula- 
tion for risk assessment in an individual from another 
population can be problematic. The problem stems from 
both allele frequency differences between populations 71 
and the fact that LD patterns across loci that mark or co- 
segregate with a putative causally associated genetic variant 
may be different from population to population. 

For instance, the tagging SNP rsl0757278, which is 
found on chromosome 9p2 1 and is associated with myo- 
cardial infarction in Caucasians 6466 , is in strong LD with 
multiple SNPs in this population (BOX 2); however, in 
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Metabolic diseases and cancer 
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Figure 3 | Overlap of genetic risk factor loci for common diseases. A surprising finding of genome-wide association 
(GWA) studies is that over 15 loci are associated with the risk of developing two or more diseases; eight are shown here 
for illustrative purposes. Some alleles may be protective for one disease but confer susceptibility to another; for 
example, the SNP R602W in PTPN22 (protein tyrosine phosphatase, non-receptor type 22 (lymphoid)) protects against 
Crohn's disease but predisposesto several other autoimmune diseases 128430 . For some genes, distinct risk alleles are 
associated with different diseases; for example, JAZF zinc finger 1 (jAZFl) has a role in prostate cancer and in type 2 
diabetes 75131132 . Thus, GWA study results indicate that many diseases that were previously viewed as having distinct 
aetiologies probably share common molecular causes. In some cases, the 'sharing' of associated genetic variants 
across diseases may be expected owing to shared clinical features of these disorders; for example, among the 
autoimmune diseases. In other cases this sharing is more surprising; for example, the involvement of glucokinase 
regulatory protein (GCKR) in both triglyceride levels and autoimmune diseases. CDKAL1, CDK5 regulatory subunit 
associated protein 1-like 1; HNF1B, HNF1 homeobox B (also known as TCF2); IL23R, interleukin 23 receptor; STAT4, 
signal transducer and activator of transcription 4; TCF7LZ, transcription factor 7-like 2 (T-cell specific, HMG-box). 



Odds ratio 

A measurement of association 
that is commonly used in 
case-control studies. It is 
defined as the odds of exposure 
to the susceptible genetic 
variant in cases compared with 
that in controls. If the odds ratio 
is significantly greater than one, 
then the genetic variant is 
associated with the disease. 



Asians this SNP is in a singleton block, and in Africans 
it is in LD with only a subset of the same SNPs present 
in Caucasians. Thus, rsl0757278 probably tags so far 
undiscovered variants differently in the three popula- 
tions. By contrast, the tagging SNP rsl3266634 on 8q24, 
which has been associated with T2D in Caucasians 7 " 5 , 
is in LD with the same set of SNPs in all three popula- 
tions, suggesting that it may tag so far undiscovered vari- 
ants similarly in the three populations. Interestingly, the 
structure of the LD bin is similar but the frequency of 
the variants is different in the populations. Thus, 
although panels of markers that capture as much vari- 
ation as possible across the genome have been devised 
to facilitate association studies in different popula- 
tions 25-28 - 76 , markers that are found to be associated with 
a particular trait or disease in any given population will 
often not be transferable for risk prediction in individuals 
from a different population. 

GWA studies of structural variants. The fact that struc- 
tural variants underlie greater than 70% of the bases that 
vary in humans suggests that they will play a profound 
part in phenotypic diversity between individuals. Thus, 
there is tremendous interest among many researchers 
to test structural variants for association with specific 
complex traits. It is important that association studies 
involving structural variants are subjected to the same 
standards of quality control and replication that have 
been developed for SNP-based studies 77 . 

Interestingly, recent studies that have looked for 
associations between rare structural variants and autism 



and schizophrenia have identified specific deletions 
involved in both of these diseases. Notable among these 
is the association between rare recurrent deletions and 
duplications of a 600 kb interval at 16pll.2 that was 
observed in multiple unrelated individuals with autism 
and was estimated to account for 1% of the cases 78-80 . 
Additionally, large deletions (>3 Mb) on chromosomes 
22qll.2, lq21.1 and 15ql3. 3, each with high estimated 
odd ratios (>17), show significant association with 
schizophrenia 81,82 . In contrast to these associations 
with specific structural associations, several studies have 
presented evidence that individuals with schizophrenia 
have a slightly increased (1.15-fold) overall load of large 
(> 100 kb) structural variants in their genomes compared 
with control individuals 82,83 . It is currently unclear what 
this slight increase in rare genome-wide structural vari- 
ants in schizophrenia patients compared with controls 
means for the aetiology of this disease. First, normal indi- 
viduals harbour many large structural variants (>8 kb). 
Second, the current framework for understanding the 
inherited basis of phenotypic traits is that specific genetic 
loci will be associated. Further studies examining larger 
sample numbers will hopefully provide insights into the 
mechanisms underlying these associations. 

Beyond current GWA studies 

An unforeseen limitation of GWA studies is that the 
genomic markers that are found to be associated with 
any given complex trait each have less impact on sus- 
ceptibility than was anticipated. The small magnitude 
of susceptibility risk (or protection from the condition of 
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Box 2 | The LD of common variants in the human genome differs between populations 
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The linkage disequilibrium (LD) structure of SNPs in a 13 kb interval of chromosome 9p21 is shown for the three HapMap 
populations: CEU (European ancestry), JPT+ CHB (Asian ancestry) and YRI (African ancestry) (see the figure). There are 
two commonly used definitions of LD,D' and r 2 , that capture different aspects of nonrandom association. On the left of 
the figure, the LD structures of the interval are shown quantified using D' 21 . SNPs that were ascertained in Phase II of the 
HapMap Project with a minor allele frequency (MAF) > 5% are shown in their respective map positions. There are 
population differences in the numbers of SNPs that meet the > 5% criterion. The pairwise correlation of SNPs — which 
are shown as vertical lines — is shown as red and white boxes, with red indicating high correlation (D' = 1) and white 
indicating no correlation (D' = 0) (other colours represent intermediate values). On the right of the figure, all SNPs are 
shown on the bottom row as black triangles. Above this, SNPs are grouped together into bins at an r 2 > 0.8 (using the 
IdSelect algorithm) 118 . SNPs that are efficiently tagged by each other (r 2 > 0.8) are shown in the same colour and are 
connected by a line. Singleton bins that do not tag any other SNPs are shown as individual blue triangles. 

Using both r 2 and D' it is clear that LD is less strong in the African population than in the Caucasian and Asian populations. 
This reflects the fact that some haplotype patterns across the genome were lost in population bottlenecks associated with 
human migration out of Africa. Using the D' statistic fewer SNPs are correlated, as indicated by a lower number of red boxes 
and more white boxes. Using the r 2 statistic Africans have a greater number of singleton SNPs, and the LD bins have fewer 
numbers of variants and span shorter lengths. The r 2 statistic shows greater differences in pairwise correlations between 
SNPs in the populations, which is due to the fact that allele frequencies vary substantially between the three groups. For 
example, in Caucasians, SNP rsl0757278 (purple box), which has been associated with myocardial infarction 6466 , lies in a 
bin composed of eleven SNPs; in Africans it is in LD with three of these SNPs, and in Asians it is in a singleton LDbin. For 
rsl0757278 the MAF is the same in Caucasians and Asians (50%) but considerably less in Africans (5%). 



interest) for each genomic marker needs to be empha- 
sized (BOX 3). Most of the odds ratios for the heterozy- 
gote genotypes of the associated variants that have been 
identified so far are approximately 1.1, a figure that can 
increase to 1.5-1.6 for homozygote genotypes. Fewer 
than 12 common genetic variants (excluding those in 
the human leukocyte antigen (HLA) region) have high 



odds ratios between 2 and 10 for association with a par- 
ticular trait 8493 . Therefore, only a limited amount of the 
genetic variance underlying the heritable component of 
any of the -80 complex traits that have been examined 
has been identified (BOX 3). Even for disease traits for 
which a large number of common genetic variants have 
been identified, only a small fraction of the inherited risk 
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Epistasis 

In statistical genetics, this term 
refers to an interaction of 
multiple genetic variants 
(usually at different loci) such 
that the net phenotypic effect 
of carrying more than one 
variant is different than would 
be predicted by simply 
combining the effects of each 
individual variant. 



has been explained. For example, in the case of Crohn's 
disease, although over 30 associated genomic markers 
have been validated, these account for less than 10% of 
the cumulative genetic variance 87 , and the 44 loci associ- 
ated with height account for ~5% 94 ~". Currently, there 
are almost no complex traits for which there is much 
greater than 10% of the genetic variance explained, leav- 
ing the bulk of heritability unexplained by the common 
variants identified so far. Thus, we are left wondering: 
where is the rest of the genetic variation underlying 
these heritable traits, and how do we capture it? 

One possibility is that the missing variation is 
accounted for by common genetic variants with small 
effect sizes that have not yet been identified. Many GWA 
studies have been conducted using sample sizes of 2,000 to 
5,000 individuals and have sufficient statistical power 
to confidently identify common variants with odds ratio 
of 1.5 or greater 18 . Therefore, it is likely that only a few, if 
any, common variants with moderate to large effect sizes 
remain to be discovered for most complex traits investi- 
gated to date. Sample sizes of 60,000 are required to pro- 
vide sufficient power to identify the majority of variants 
with odds ratios of 1.1 (REF. 1 8). This raises the possibility 
that when higher-powered GWA studies are ultimately 
performed the number of common variants with small 
effect sizes — but unequivocal statistical association — 
may substantially increase. It is important to note that 
GWA studies and meta-analyses of combined GWA stud- 
ies have been conducted using 20,000 to 40,000 samples 
for lipid phenotypes (low- and high-density lipoprotein, 
total cholesterol and triglyceride levels) 100102 but still only 
a small proportion of trait variance (5 to 10%) has been 
identified, leaving much of the heritability of these traits 
unexplained. Similarly, a meta-analysis of GWA studies 
for height that used an effective sample size of 27,000 
(REFS 98,99) found less than 5% of the genetic variance 
underlying the trait. It is unknown how much more of 
the genetic variance underlying complex traits will be 
accounted for by increasing sample sizes in GWA stud- 
ies. However, given the results of the studies conducted so 
far using large numbers of individuals it is reasonable to 
expect that GWA studies using 60,000 to 100,000 individ- 
uals will probably capture at most 10-15% of the genetic 
variance underlying any given phenotype. 



Some of the missing heritability is likely to be 
accounted for by rare and novel variants (BOX 4). Because 
rare variants are not in LD with tagging SNPs the GWA 
studies currently being conducted are not able to capture 
their contribution to complex traits. The sequencing of 
candidate genes involved in the development of color- 
ectal adenomas 103 and in the metabolism of lipid 104106 
and folate 107 suggests that rare variants with moderate to 
high penetrances contribute to the genetic components 
of common complex traits. To investigate the role of 
rare variants in complex traits, a technological advance 
enabling rare variants to be directly assayed is required. 
This breakthrough will probably come from new DNA 
sequencing technologies, which may, in the next 3 to 
5 years, be capable of generating genomic sequences of 
thousands of individuals in a cost-effective way. The 
methods being proposed to analyse such future data sets 
to identify rare variants associated with complex traits 
are in theory straightforward (BOX 4), but they will be 
complex to implement as they largely rely on looking for 
frequency differences of functional rare variants in cases 
versus controls. Similar to the identification of functional 
common variants, the alleles of rare variants that cluster 
in coding sequences will be easy to find but the meth- 
ods for efficiently annotating variants in other functional 
elements are not yet on the horizon. To develop such 
methods it is necessary to first identify and catalogue 
all functional elements in the human genome 108 , and 
then determine which nucleotides in these elements, if 
altered, would have functional consequences. 

Finally, there are statistical limitations of the GWA 
approach in identifying gene-gene and gene-environment 
interactions, which are likely to be profoundly important. 
The effects of genetic background on the impact that a 
particular genetic variant has on a phenotype are well 
documented in the model organism literature. For exam- 
ple, it is well known that the results of knockout, trans- 
gene, chromosome substitution and QTL mapping studies 
in mice are all influenced by the choice of strain 109113 . 
Initial attempts to identify epistasis in GWA data have 
been unfruitful 114116 and it is unclear how to proceed. 
Conceptual advances in our understanding of the mech- 
anisms underlying gene-gene and gene-environment 
interactions are required before we can accurately model 
and measure their effects in complex traits in humans. 



Box 3 | How many genetic variants do we expect to find for complex traits? 

The 18 genetic variants that have been associated with type 2 diabetes (FIG. 2) have 
minor allele frequencies (MAFs) ranging from 0.073 to 0.50 and odds ratios (ORs) 
ranging from 1.05 to 1.1 5, except for the TCF7LZ gene, which has an OR of 1.37. These 
MAFs and ORs are typical of what is observed for the genetic variants discovered in 
genome-wide association (GWA) studies for other diseases and complex phenotypic 
traits. Altogether, these 18 variants explain less than 4% of the total liability of the trait, 
which is only a small fraction of the estimated heritability. This implies that there are 
many more genes to be identified that contribute to the genetic components of the 
disease. Assuming that the undiscovered genetic variants have similar MAFs and ORs 
as those that have been identified, and estimating 40% heritability, more than 800 
genetic variants are required (Y. Pawitan, personal communication). If we assume that 
the undiscovered genetic variants are largely rare (BOX 4) with MAFs that are -10 times 
smaller than those identified to date (0.0073 to 0.05) and ORs that are -10 times larger 
(1.63 to 4.05), then -85 variants are required (Y. Pawitan, personal communication). 



Summary and conclusions 

Genetic architecture of complex traits. During the past 
few years there have been tremendous advances in our 
knowledge of genome-wide LD patterns of SNPs, the rel- 
ative contributions of single nucleotide versus structural 
variants in overall genetic diversity, and the range of effect 
sizes for common variants. In spite of these advances we 
have a limited understanding of the genetic architecture 
of complex traits, including the number of genetic vari- 
ants that influence any one trait, their allele frequencies, 
effect sizes and modes of interactions. The results of 
GWA studies over the past 2 years have cast doubt over 
the validity of the common disease-common variant 
hypothesis. This is largely because the low odds ratios of 
common single nucleotide variants (BOX 3), even assuming 



248 | APRIL 2009 | VOLUME 10 



www.nature.com/reviews/genetics 



© 2009 Macmillan Publishers Limited. All rights reserved 



REVIEWS 



additive penetrances, preclude them from being respon- rare (MAF < 5%) and/or novel single nucleotide variants 

sible for the familial clustering of most complex traits, that affect protein function 117 . Gene-centric, genome-wide 

What about common structural variants? Based on our rare variant sequencing programmes are underway, and 

current understanding the majority of common struc- therefore the extent to which these variants contribute 

tural variants should be in LD with SNPs and thus may to the familial aggregation of any given complex trait 

have already been assayed by proxy in GWA studies 18,41 , should be determined in the next 3 to 4 years. However, 

Although a major challenge with current technologies, it we will not be able to fully address the role of rare variants, 

is a priority to catalogue the locations and frequencies of including non-coding and structural variants, until rapid 

common structural variants and empirically determine cost-effective methods for sequencing entire genomes are 

their LD patterns across the genome. Only after this is available. Given that the new DNA sequencing technolo- 

accomplished will we have a firm understanding of how gies have difficulties in accurately identifying and charac- 

well common structural variants are assayed in GWA terizing structural variants, it is hard to predict when this 

studies based on tagging SNPs. will become technically feasible. 

Another pressing issue is the importance of rare and Over the next 5 to 10 years, systematic exploration of 

novel variants in the familial aggregation of complex the universe of variants and epistasis, and of epigenomics, 

traits. On the basis of the analysis of the Venter genome, will undoubtedly provide tremendous insights into the 

Caucasians are likely to carry 200 to 500 non-synonymous genetic architecture of complex traits. With our current 

Box 4 | Linking rare genetic variants to common complex traits 

Much remains to be determined about the relative contribution of rare variants to common complex traits, but the 
findings of several studies to date provide some insights. Although the rigid definition of a rare variant is one present 
with a minor allele frequency (MAF) of less than 1%, the frequency boundaries used in the literature vary. Here, we 
define variants with MAFs between 0.1% and 3% as rare variants and MAFs of less than 0.1% as novel (for 
high-frequency common variants that are in linkage disequilibrium (LD) with one another the MAF is greater than 5%). 

Rare variants identified in GWA studies 

In general, rare variants are not in LD with common variants and therefore will not be detected in genome-wide 
association (GWA) studies 51 . However, when the cohorts in a GWA study have a substantial number of individuals that 
share distant ancestors, 10 to 20 generations ago, it is sometimes possible to identify rare, highly penetrate variants. In 
a recent GWA study 115 involving 809 Old Order Amish individuals, an SNP on chromosome llq23 with a minor allele 
frequency (MAF) of 0.028 was associated with markedly lower fasting serum triglycerides levels, higher levels of 
high-density lipoprotein (HDL) cholesterol and lower levels of low-density lipoprotein (LDL) cholesterol. With much 
study and a keen understanding of lipid metabolism the investigators demonstrated that the associated SNP tags a 
loss of function variant located 823 kb away in apolipoprotein C3 (APOC3). Consistent with a favourable lipid profile, 
carriers of this variant (APOC3 R19X) are significantly less likely than non-carriers to have coronary artery 
calcification. Importantly, the origin of the APOC3 R19X variant was shown to be from a founding couple of the 
Lancaster Amish born in the early 1800s, with all carriers being descended from this couple. In a second GWA study 120 
involving 4,763 individuals in the Northern Finland Birth Cohort 1966, a variant on chromosome X with a MAF of 
0.017 was identified as associated with markedly increased LDL cholesterol levels in the 38 males that carried it. The 
variant is located in an intron of AR — a ligand-dependent transcription factor controlling circulating androgen levels, 
which are in part responsible for sex-specific dyslipidaemias. It is likely that this variant was present in a founding 
member of the Finnish population and was inherited in the 38 males by descent. 

Rare variant identification and association testing in sequencing studies 

The discovery and association testing of rare variants with the propensity to develop colorectal adenomas has been 
performed through a candidate gene sequencing study 103 . The investigators analysed 124 UK patients with multiple polyps 
and 483 random controls for germ line variants in five genes, three that are involved in the Wnt signalling pathway (APC, 
AXIN1 and CTNN81) and two in mismatch repair (MLH1 and MSH2) by DNA sequencing. Overall, 24% of the individuals 
with adenoma had a rare potentially pathogenic variant in one of the five genes compared with 11.5% of the controls. The 
rare variants aggregated as a class and their combined frequency differences between cases and controls are significantly 
different with an odds ratio of 2.2. Interestingly, several of the rare variants identified in this study with odds ratios of -2.0 
were shown by examining surrounding sequences to have a common origin and thus, in effect, be founder alleles. 

Rare variant characteristics 

These studies demonstrate several important characteristics of rare variants. First, rare variants will often have a 
different population history to common variants. Common variants are ancient and are frequently present in all 
human populations (BOX 2), whereas rare variants are likely to be population specific 51 , having originated from founder 
effects 10 to 20 generations ago. Second, rare variants that are associated with complex phenotypes are likely to 
have effect sizes larger than those of common variants. The penetrance of rare variants will vary and in some cases be 
high — such as APOC3 R19X, in which all carriers have a favourable lipid profile 119 — and in other cases the 
penetrance will be considerably lower — such as those found underlying colorectal adenomas 103 . Finally, as 
technological advances allow rare variants to be directly assayed, some individual alleles will be significant when 
tested for association with a complex trait 119120 . In other cases, rare variants will have to be aggregated as a class and 
compared between cases and controls 103 — this will always be the case for novel variants. Similar studies examining 
the roles of rare and novel variants in triglyceride and cholesterol serum levels overall support these conclusions 104106 . 
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knowledge, what do we predict that the future will tell us? 
Given the low effect sizes of common variants identified 
through GWA studies and the fact that a surprising frac- 
tion are associated with multiple diseases, it will probably 
be shown that they do not account for familial concentra- 
tion of phenotypic traits but rather that they modify the 
penetrance of casual rare variants with large effect sizes. 

The impact of human genetics on medicine. The knowl- 
edge gained through human genetic studies will have 
a major impact on medical sciences. In the short term 
our increased understanding of the molecular pathways 
involved in disease provides new potential drug targets. 
In the long term the ability to predict disease susceptibil- 
ity, as well as classify diseases into subphenotypes from 
genotypic information, will result in improved treatment 



and an expanded use of pharmacogenetics. The ability to 
stratify individuals according to genotype has the poten- 
tial to make clinical trials more cost-effective and time- 
efficient by enrolling a much smaller number of patients 
with an anticipated larger treatment effect when the 
intervention is more precisely matched with the under- 
lying altered biology. The majority of existing cohorts 
have been collected for case-control study designs and 
therefore can only provide a snapshot assessment of the 
association of a genetic variant and a particular trait. 
However, the natural progression of a disease cannot 
be adequately probed through such studies. Therefore, 
we call for the collection and analysis of carefully phe- 
notyped prospective cohorts, which will be essential to 
develop accurate risk and disease course prediction from 
genotypic information. 
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